Parameter Learning in PRISM Programs with 
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Abstract 

Probabilistic Logic Programming (PLP), exemplified 
by Sato and Kameya's PRISM, Poole's ICL, De Raedt 
et al's ProbLog and Vennekens et al's LPAD, combines 
statistical and logical knowledge representation and in- 
ference. Inference in these languages is based on enu- 
merative construction of proofs over logic programs. 
Consequently, these languages permit very limited use 
of random variables with continuous distributions. In 
this paper, we extend PRISM with Gaussian randoin 
variables and linear equality constraints, and consider 
the problem of parameter learning in the extended lan- 
guage. Many statistical models such as finite mixture 
models and Kalman filter can be encoded in extended 
PRISM. Our EM-based learning algorithm uses a sym- 
bolic inference procedure that represents sets of deriva- 
tions without enumeration. This permits us to learn 
the distribution parameters of extended PRISM pro- 
grams with discrete as well as Gaussian variables. The 
learning algorithm naturally generalizes the ones used 
for PRISM and Hybrid Bayesian Networks. 

Introduction 

Probabilistic Logic Programming (PLP) is a class 
of Statistical Relational Learning (SRL) frame- 
works (jGetoor and Taskar 2007^ which combine statis- 
tical and logical knowledge representation and infer- 
ence. PLP languages, such a s SLP (Muggleton 1996 ), 
ICL (iPoole 2008), PRISM (|Sato and Kameya 1997), 



ProbLo g (jDe Raedt, Kimmig, and Toivonen 2007 ) and 
LPAD ( [Vennekens, Verbaeten, and Bruynooghe 2004[ ) 
extend traditional logic programming languages by 
implicitly or explicitly attaching random variables 
with certain clauses in a logic program. A large 
class of common statistical models, such as Bayesian 
networks. Hidden Markov models and Probabilistic 
Context-Free Grammars have been effectively en- 
coded in PLP; the programming aspect of PLP has 
also been exploited to succinctly specify complex 
models, such as discovering links in biological net- 
works (De Raedt, Kimmig, and Toivonen 2007). 
Parameter learning in these languages is typ- 



ically done by variants of the EM algo- 
rithm ( [Dempster, Laird, and Rubin 1977D . 

Operationally, combined statistical/logical inference 
in PLP is based on proof structures similar to those 
created by pure logical inference. As a result, these 
languages have limited support models with con- 
tinuous random variables. Recently, we extended 
PRISM ( [Sato and Kameya 19"97| with Gaussian and 
Gamma-distributed random variables, and linear equal- 
ity constraints (http://arxiv.org/abs/1112.2681l3. 
This extension permits encoding of complex statistical 
models including Kalman filters and a large class of Hy- 
brid Bayesian Networks. 

In this paper, we present an algorithm for parame- 
ter learning in PRISM extended with Gaussian random 
variables. The key aspect of this algorithm is the con- 
struction of symbolic derivations that succinctly repre- 
sent large (sometimes infinite) sets of traditional logi- 
cal derivations. Our learning algorithm represents and 
computes Expected Sufficient Statistics (ESS) symbol- 
ically as well, for Gaussian as well as discrete random 
variables. Although our technical development is lim- 
ited to PRISM, the core algorithm can be adapted to 
parameter learning in (extended versions of) other PLP 
languages as well. 

Related Work SRL frameworks can be broadly 
classified as statistical-model-based or logic-based, 
depending on how their semantics is defined. In the 
first category are languages such as Bayesian Logic 
Programs (BLPs) ( [Kersting and De Raedt 2001 



Probabilistic Relational Mode 

(PRMs) (jNarmanet al. 2010P, and Markov Logi c 
Networks (MLNs) ( [Richardson and Domingos 2006[ ) , 
where logical relations are used to specify a model 
compactly. Although originally defined over discrete 
random variables, these languages have been extended 
(e.g. Continuous BLP (Kersting and De Raedt 2001[ ), 
Hybrid PRM (|Narman et al. 2010P , aiid Hybrid 
MLN ( [Wang and Domingos 2008] )) to support con- 
tinuous random variables as well. Techniques for 
parameter learning in statistical-model-based Ian- 
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^Relevant technical aspects of this extension are summa- 
rized in this paper to make it self-contained. 



guagcs are adapted from the corresponding techniques 
in the underlying statistical models. For example, 
discriminative learning techniques are used for param- 
eter learning in MLNs ( Singla and Domingos 2005 
|Lowd and Domingos 2007] ). 

Logic-based SRL languages include the PLP 
languages mentioned earlier. Hybrid ProbLog 

([Gu tmann, Jaeger, and De Raedt 2010[ ) extends 

ProbLog by adding continuous probabilistic facts, 
but restricts their use such that statistical models 
such as Kalman filters and certain classes of Hybrid 
Baycsian Networks (with continuous child with con- 
tinuous parents) cannot be encoded. More recently, 
IGutmann et al.l(|201ip introduced a sampling-based 
approach for (approximate) probabilistic inference in a 
ProbLog-like language. 

Graphical EM ( [Sato and Kameya 1999 ) is the 
parameter learning algorithm used in PRISM. 
Interestingly, graphical EM reduces to the Baum- 
Welch (jRabiner 1989P algorithm for HM Ms en- 
coded in PRISM. IGutmann eTan (PHHHI intro- 
duced a least squares optimization approach to 
learn distribution parameters in ProbLog. Co- 
PrEM ( IGutmann, Thon, and De Raedt 20TT| ) is 
another algorithm for ProLog that computes binary 
decision diagrams (HDDs) for representing proofs 
and uses a dynamic programming approach to esti- 
mate parameters. BO-EM (jlshihata et al. 2010p is a 
BDD-based parameter learning algorithm for PRISM. 
These techniques enumerate derivations (even when 
represented as BDDs). and do not readily generalize 
when continuous random variables are introduced. 

Background: An Overview of PRISM 

PRISM programs have Prolog-like syntax (see Exam- 
ple [IJ . In a PRISM program the msw relation ( "multi- 
valued switch") has a special meaning: msw(X,I,V) 
says that V is a random variable. More precisely, V is 
the outcome of the I-th instance from a family X of ran- 
dom processefl The set of variables {V^ | msw(p, i, V^)} 
are i.i.d., and their distribution is given by the random 
process p. The msw relation provides the mechanism for 
using random variables, thereby allowing us to weave 
together statistical and logical aspects of a model into 
a single program. The distribution parameters of the 
random variables are specified separately. 

PRISM programs have declarative semantics, 
called distribution semantics ([Sato and Kameya 1997 



|Sato and Kameya 1999[ ). Operationally, query evalua- 



tion in PRISM closely follows that for traditional logic 
programming, with one modification. When the goal 
selected at a step is of the form msw(X, I , Y) , then Y is 
bound to a possible outcome of a random process X. 
The derivation step is associated with the probability 
of this outcome. If all random processes encountered 



in a derivation are independent, then the probability 
of the derivation is the product of probabilities of 
each step in the derivation. If a set of derivations are 
pairwise mutually exclusive, the probability of the set 
is the sum of probabilities of each derivation in the 
selo. Finally, the probability of an answer to a query 
is computed as the probability of the set of derivations 
corresponding to that answer. 

As an illustration, consider the query fmix(X) eval- 
uated over program in Example [TJ One step of resolu- 
tion derives goal of the form msw (m, M) , msw(w(M),X). 
Now depending on the value of m, there are two possi- 
ble next steps: msw(w(a) ,X) and msw(w(b),X). Thus 
in PRISM, derivations are constructed by enumerating 
the possible outcomes of each random variable. 

Example 1 (Finite Mixture Model) In the follow- 
ing PRISM program, which encodes a finite mixture 
model llMcLachlan and Peel 200 1\) . msw(m, M) chooses 
one distribution from a finite set of continuous distri- 
butions, msw(w(M) , X) samples X from the chosen dis- 
tribution. 

fmix(X) :- msw(m, M) , 

msw(w(M), X). 

y. Ranges of RVs 

values (m, [a,b]). 

values(w(M), real). 

7, PDFs and PMFs 

:- set_sw(m, [0.3, 0.7]) , 

set_sw(w(a) , norm(2.0, 1.0)), 

set_sw(w(b), normO.O, 1.0)). 
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Extended PRISM 



^Following PRISM, we often omit the instance number 
in an msw when a program uses only one instance from a 
family of random processes. 



Support for continuous variables is added by modify- 
ing prism's language in two ways. We use the msw 
relation to sample from discrete as well as continu- 
ous distributions. In PRISM, a special relation called 
values is used to specify the ranges of values of random 
variables; the probability mass functions are specified 
using set_sw directives. We extend the set_sw direc- 
tives to specify probability density functions as well. 
For instance, set_sw(r, norm(Mu, Var) ) specifies that 
outcomes of random processes r have Gaussian distri- 
bution with mean Mu and variance Var. Parame- 
terized families of random processes may be specified, 
as long as the parameters are discrete- valued. For in- 
stance, set_sw(w(M) , norm(Mu,Var) ) specifies a fam- 
ily of random processes, with one for each value of M. As 
in PRISM, set_sw directives may be specified program- 
matically; for instance, the distribution parameters of 
w(M), may be computed as functions of M. 

Additionally, we extend PRISM programs with linear 
equality constraints over reals. Without loss of gener- 
ality, we assume that constraints are written as linear 
equalities of the form Y ^ ai * Xi + . . . + On * Xn + b 
where a^ and b are all fioating-point constants. The 



^The evaluation procedure is defined only when the in- 
dependence and exclusiveness assumptions hold. 



Gi : fmix{X} 

I 

G2 '■ msw{m, M),msw(w(A4), X). 

\ 

G3 : msw(w(M),X). 



Figure 1: Symbolic derivation for goal fmix(X) 

use of constraints enables us to encode Hybrid Bayesian 
Networks and Kalman Filters as extended PRISM pro- 
grams. In the following, we use Constr to denote a set 
(conjunction) of linear equality constraints. We also 
denote by X a vector of variables and/or values, explic- 
itly specifying the size only when it is not clear from 
the context. This permits us to write linear equality 
constraints compactly (e.g., Y = a- X + b). 

Inference 

The key to inference in the presence of continuous ran- 
dom variables is avoiding enumeration by representing 
the derivations and their attributes symbolically. A sin- 
gle step in the construction of a symbolic derivation is 
defined below. 

Definition 1 (Symbolic Derivation) A goal G di- 
rectly derives goal G' , denoted G -^ G' , if: 

PCR: G ~ qi{Xi),Gi, and there exists a clause 
in the program, qi{Y):-ri{Yi),r2{Y2), . . . ,rm{Yrn), 
such that 6 = Tiig\i{qi{Xi),qi{Y)); then, G' = 
(ri(l^),r2(T^),...^„(i^),Gi)0; 

MSW: G = msw(rv(X),y),Gi; thenG' = Gi; 

CONS: G = Constr, Gi and Constr is satisfiable: 
thenG' ^Gi. 

A symbolic derivation of G is a sequence of goals 
Gq,Gi, . . . such that G ~ Gq and, for all i > Q, 
Gi — > Gi+i. 

Note that the traditional notion of derivation in a 
logic program coincides with that of symbolic deriva- 
tion when the selected subgoal (literal) is not an msw 
or a constraint. When the selected subgoal is an msw. 
prism's inference will construct the next step by enu- 
merating the values of the random variable. In con- 
trast, symbolic derivation skips msw's and constraints 
and continues with the remaining subgoals in a goal. 
The effect of these constructs is computed by associ- 
ating (a) variable type information and (b) a success 
function (defined below) with each goal in the deriva- 
tion. The symbolic derivation for the goal finix(X) over 
the program in Example [T] is shown in Fig. [TJ 

Success Functions: Goals in a symbolic derivation 
may contain variables whose values are determined by 
msw's appearing subsequently in the derivation. With 
each goal Gi in a symbolic derivation, we associate a set 
of variables, V{Gi), that is a subset of variables in Gi. 
The set V{Gi) is such that the variables in V{Gi) sub- 
sequently appear as parameters or outcomes of msw's in 



some subsequent goal Gj, j > i. We can further par- 
tition V into two disjoint sets, Vc and Vd, representing 
continuous and discrete variables, respectively. 

Given a goal Gi in a symbolic derivation, we can asso- 
ciate with it a success function, which is a function from 
the set of all valuations of V{Gi) to [0,1]. Intuitively, 
the success function represents the probability that the 
symbolic derivation represents a successful derivation 
for each valuation of V{Gi). Note that the success func- 
tion computation uses a set of distribution parameters 
O. For simplicity, we often omit it in the equations and 
use it when it's not clear from the context. 

Representation of success functions: Given a set 
of variables V, let C denote the set of all linear equal- 
ity constraints over reals using V. Let L be the set of 
all linear functions over V with real coefficients. Let 
NxifJ-, cr^) be the PDF of a univariate Gaussian distri- 
bution with mean ^ and variance cr^, and 5x{X) be the 
Dirac delta function which is zero everywhere except at 
X and integration of the delta function over its entire 
range is 1. Expressions of the form k*Y\i5y{Vi) Hi A//. , 
where fc is a non-negative real number and fi G L, are 
called product PDF (PPDF) functions over V. We use 
(j) (possibly subscripted) to denote such functions. A 
pair {((), G) where G C C is called a constrained PPDF 
function. A sum of a finite number of constrained 
PPDF functions is called a success function, represented 

We use Ci{tp) to denote the constraints (i.e., Gi) in 
the i*'' constrained PPDF function of success function 
-tp; and Di{ip) to denote the i*'' PPDF function of ip. 

Success functions of base predicates: The suc- 
cess function of a constraint G is (1,G). The success 
function of true is (1, true). The PPDF component of 
msw(rv(Ar), y)'s success function is the probability den- 
sity function of rv's distribution if rv is continuous, and 
its probability mass function if rv is discrete; its con- 
straint component is true. 

Example 2 The success function of msw(m,M) for the 
program in Example]^ is tpi = 0.3(5a(A/) + 0.7(5(,(A/). 

The success function msw(w(M) , X) for the pro- 
gram in Example [7] is -02 = Sa{M)J\fx{2.0,1.0) + 
(5b(A/)A6f(3.0,1.0). D 

Success functions of user-defined predicates: If 

G ^> G' is a step in a derivation, then the success func- 
tion of G is computed bottom-up based on the success 
function of G'. This computation is done using join 
and marginalize operations on success functions. 

Definition 2 (Join) Let ipi ^ ^i{Di,Gi) and ip2 = 
'^j{Dj,Gj) he two success functions, then join of tpi 
and 02 represented as ipi * ip2 is the success function 
T.^AD,D„G,AC,). 

Given a success function -0 for a goal G, the suc- 
cess function for 3X. G is computed by the marginal- 
ization operation. Marginalization w.r.t. a discrete 



variable is straightforward and omitted. Below we de- 
fine marginalization w.r.t. continuous variables in two 
steps: first rewriting the success function in a projected 
form and then doing the required integration. 

Projection eliminates any linear constraint on V, 
where V is the continuous variable to marginalize over. 
The projection operation, denoted by tp ^v, involves 
finding a linear constraint (i.e., V = a- X + b) onV and 
replacing all occurrences of V in the success function 
hya-X + b. 

Proposition 1 Integration of a PPDF function with 
respect to a variable V is a PPDF function, i.e., 



/oo m 
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where V e Xk and V ^ X[. 

Definition 3 (Integration) Let ^ be a success func- 
tion that does not contain any linear constraints on V . 
Then integration of tp with respect to V , denoted by 
<fyip is a success function ip' such that \/i.Di{ip') = 
jD.,{yj)dV. 

Definition 4 (Marginalize) Marginalization of a 
success function tjj with respect to a variable V , denoted 
by M(-0, V), is a success function ip' such that 



^' 



i' U 



We overload M to denote marginalization over a 
set of variables, defined such that M(f/;, {V} U X) = 

M(M(V;, V),X) and M(V', {}) = ^■ 

The success function for a derivation is defined as 
follows. 

Definition 5 (Success function of a derivation) 

Let G ^ G' . Then the success function of G, denoted 
by ipGj is computed from that of G' , based on the way 
G' was derived: 

PCR: ^G^M{i^G',V{G')-ViG)). 

MSW: Let G = msw(rv(X),y), Gi. Then ^Pg = 

''Pmsw{rv(X),Y) * V-'G' • 

CONS: Let G = Constr,Gi. Then tpG ~ "Pconstr * 

IpG'- 

Note that the above definition carries PRISM's assump- 
tion that an instance of a random variable occurs at 
most once in any derivation. In particular, the PCR 
step marginalizes success functions w.r.t. a set of vari- 
ables; the valuations of the set of variables must be mu- 
tually exclusive for correctness of this step. The MSW 
step joins success functions; the goals joined must use 
independent random variables for the join operation to 
correctly compute success functions in this step. 



Example 3 Fig. [7] shows the symbolic derivation 
for goal fmix(X) over the finite mixture model 
program in Example [H Success function of 

goal Gz is ipmsw{w{M),x){M,X), hence ipGs = 
5a[M)Mx{tia,al) + 5^,{M)Mx{^ib,al). 

■0G2 is ■pmsw(mM){M) * ipGsiM^X) wMck yields 

i^G2 =PaSa{M)J\fxipia,(Tl)+pb6b{M)J^x(pb,(T^i)- Note 
that Sb{M)Sa{M) = as M can not be both a and b at 
the same time. Also 6a{M)6a{M) = Sa{M). 

Finally, ipd =M(-0g2,M) which is PaMx{t'^aT<^'^) + 
pi,Afx{fJ'b,o'b). Note that ipGi represents the mixture 
distribution llMcLachlan and Peel 2001\) of mixture of 
two Gaussian distributions. 

Here pa = 0.3, pb = 0.7, /ia = 2.0, /i^ = 3.0, and 



gI = al = 1.0. 
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Note that for a program with only discrete ran- 
dom variables, there may be exponentially fewer sym- 
bolic derivations than concrete derivations a la PRISM. 
The compactness is only in terms of number of deriva- 
tions and not the total size of the representations. In 
fact, for programs with only discrete random variables, 
there is a one-to-one correspondence between the en- 
tries in the tabular representation of success functions 
and prism's answer tables. For such programs, it is 
easy to show that the time complexity of the inference 
presented in this paper is same as that of PRISM. 

Learning 

We use the expectation-maximization algo- 
rithm ( [Dempster, Laird, and Rubin 1977[ ) to learn 
the distribution parameters from data. First we show 
how to compute the expected sufficient statistics 
(ESS) of the random variables and then describe our 
algorithm. 

The ESS of a discrete random variable is a n- 
tuple where n is the number of values that the dis- 
crete variable takes. Suppose that a discrete ran- 
dom variable V takes t;i,W2, ...,«« as values. Then 
the ESS of V is {ESS'^=''' ^ESS^^"^ ,...,ESS'^=''-) 
where ESS^^""^ is the expected number of times vari- 
able V had valuation Vi in all possible proofs for a 
goal. The ESS of a Gaussian random variable X is a 
triple {ESS^'t',ESS^'''\ESS^'''°''"'*) where the com- 
ponents denote the expected sum, expected sum of 
squares and the expected number of uses of random 
variable X, respectively, in all possible proofs of a goal. 
When derivations are enumerated, the ESS for each ran- 
dom variable can be represented by a tuple of reals. To 
accommodate symbolic derivations, we lift each com- 
ponent of ESS to a function, represented as described 
below. 

Representation of ESS functions: For each com- 
ponent v (discrete variable valuation, mean, variance, 
total counts) of a random variable, its ESS function in 
a goal G is represented as follows: 



^G^E^^*-^-^')- 



where {(f)i,Ci) is a constrained PPDF function and 



Qi ■ Xi + b, 


li V = X, /i 


Hi ■ X^ + hi 
h 


if V =X, (T^ 

otherwise 



Here Hi, bi are constants, and Xi = Vc{G). 

Note that the representation of ESS function is same 
as that of success function for discrete random vari- 
able valuations and total counts. Join and Marginalize 
operations, defined earlier for success functions, can be 
readily defined for ESS functions as well. The computa- 
tion of ESS functions for a goal, based on the symbolic 
derivation, uses the extended join and marginalize op- 
erations. The set of all ESS functions is closed under 
the extended Join and Marginalize operations. 

ESS functions of base predicates: The ESS func- 
tion of the i parameter of a discrete random variable 
V is P{V = Vi)5^,^{V). The ESS function of the mean 
of a continuous random variable X is XAfxifi, <^^), and 
the ESS function of the variance of a continuous random 
variable X is X■^J\fx{^J,,cT'^)■ Finally, the ESS function 
of the total count of a continuous random variable X is 

Example 4 In this example, we compute the ESS 
functions of the random variables (m, w(a), and 
w(b)) in Example [IJ According to the definition of 
ESS function of base predicates, the ESS functions 
of these random variables for goals msw{m, M) and 
msw{w{M),X) are 



ESS 


for rasw{m^ M) 


for msw(w{M), X) 


e 


PkSkiM) 





^Mfc 





X6u{M)Nx{lik,ol) 


r^ 





XHk{M)Mx{pik,<jl) 


tcouritk 





5,iMWxif^k,ai) 



where k g {a, b}. 
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ESS functions of user-defined predicates: If 

G — ?> G' is a step in a derivation, then the ESS func- 
tion of a random variable for G is computed bottom-up 
based on the its ESS function for G'. 

The ESS function of a random variable component in 
a derivation is defined as follows. 

Definition 6 (ESS functions in a derivation) Let 

G ^ G' . Then the ESS function of a random variable 
component v in the goal G, denoted by (J^, is computed 
from that of G' , based on the way G' was derived: 



Then ^^ 



PCR: CG=meG',ViG')-V{G)). 
MSW: Let G = msw(rv(Y),y), Gi. 

CONS: LetG^Constr,Gi. Then ^^ ^ ^Jconstr'^'^G'- 

Example 5 Using the definition of ESS function of a 
derivation involving MSW, we compute the ESS func- 
tion of the random variables in goal G2 of Fig. [IJ 





ESS functions for goal G2 


e 


PkSk{M)f^xifik,<ji) 


e" 


Xpk5k{M)Nx{pik,(rl) 


r^ 


X^Pk5k{M)Nx{l^k,cjl) 


^ count 1^ 


Pk5k{M)Nx{pik,al) 



Notice the way ^q is computed. 



^k 

?G2 



Vmsw{m,M)£,G3 



' VG3^msw(m,M) 



= [Pa5a{M)+pb5b{M)].Q 

+ [5a{M)Mx{^ia,al) + 5^,{M)Mx{^ib,(Jl)]■Pk5k{M) 
= Pk5k{M)Nx{lJik,(yl) 

Finally, for goal Gi we marginalize the ESS functions 
w.r.t. M. 





ESS functions for goal Gi 


^k 


PkJ^x{f^k,crl) 


e" 


XpkMx{f^k,<j't) 


r^ 


X^PkNx{Hk,al) 


tcountk 


PfcA/'x(Mfc,crfc) 
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The algorithm for learning distribution parameters 
(0) uses a fixed set of training examples (^1,^2, •■•,^Ar)- 
Note that the success and ESS functions for t^'s are 
constants as the training examples are variable free (i.e., 
all the variables get marginalized over). 

Algorithm 1 (Expectation-Maximization) 

Initialize the distribution parameters Q. 

1. Construct the symbolic derivations for -tp and ^ using 
current Q. 

2. E-step: For each training example ti (1 < i < N ), 
compute the ESS (S^a) of the random variables, and 
success probabilities ^pti\ w.r.t. Q. 

M-step: Compute the MLE of the distribution pa- 
rameters given the ESS and success probabilities (i.e., 
evaluate 0'). O' contains updated distribution pa- 
rameters (p',^',a'^). More specifically, for a discrete 
random variable V , its parameters are updated as fol- 
lows: 

I VV=v 

PV=v 



2.^uevalues{V) ^V= 



where 



'nv=i. 



N 

E 






For each continuous random variable X, its mean 
and variances are updated as follows: 



Mx 



,'2 






N 



X 






'X 



^ti 



N. 



'2 
Mx 



where Nx is the expected total count of X . 



N 



iv. = E 



X,count 



^u 



3. Evaluate the log likelihood (lnP(ti, .., ijvl©') = 
^^Irn/jf^J and check for convergence. Otherwise let 
<— 0' and return to step 1. 

Theorem 2 Algorithm Q] correctly computes the MLE 
which (locally) maximizes the likelihood. 

(Proof) Sketch. The main routine of Algorithm [T] for 
discrete case is same as the learn-naive algorithm of 
|Sato and Kameya|(|1999p . except the computation of 

•nv=v 



VV=v 



E ^j:p(s)ni 



t 



^];=" = P(5l:„)iVi:„. For 5i.„+i = .91,32, . 

-P('S'l:n+l)A'"l:„+l 



for each goal g 

where S is an explanation for goal g and Ng is the total 
number of times V = v h\ S. 

We show that Cj='' = Xis P{S)N^s- 

Let the goal g has a single explanation S where S 
is a conjunction of subgoals (i.e., S'i:„ = 51,52, ■■■,gn)- 
Thus we need to show that ^J=^ = P(S')7V^.' 

We prove this by induction on the length of S. The 
definition of £_ for base predicates gives the desired result 
for n = 1. Let the above equation holds for length n i.e., 

•, 9ni .9n+l, 

P{gi, 92, ■-, gn, gn+l)Ni.,n+l 

= P(gi,g2,...,S„)P(s„ + i)(JVi:„+Af„ + i) 

= P{g,^ + l)lP{Sl:„)N^,„]+P{S^,^)lP{g„ + ^)N„ + ^] 
^P{gn+l)eg=:+P{S,:n)il:Z 

The last step follows from the definition of ^ in a deriva- 
tion. 

Now based on the exclusiveness assumption, for dis- 
junction (or multiple explanations) like g = gi V 32 it 
trivially follows that ^g='" ^ (,^='" + C^=^' . 

Example 6 Let xi,X2, ...jXn be the observations. For 
a given training example ti = fmix{xi), the ESS func- 
tions are 





ESS functions for goal fmix[xi) 
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t,count^ 


Pk^fx{x^\^J.k,<7l) 



The E-step of the EM algorithm involves computation 
of the above ESS functions. 

In the M-step, we update the model parameters from 
the computed ESS functions. 
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Example 7 This example illustrates that for the mix- 
ture model example, our ESS computation does the 
same computation as standard EM learning algorithm 
for mixture models {Bishop 2006^ . 

Notice that for Equation [21 4^ = ^"^"/.^'^''''''l 
which is nothing but the posterior responsibilities pre- 
sented in {Bishop 2006^. 
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Similarly for Equation \^ 
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Variances are updated similarly. 
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Discussion and Concluding Remarks 

The symbolic inference and learning procedures enable 
us to reason over a large class of statistical models 
such as hybrid Bayesian networks with discrete child- 
discrete parent, continuous child-discrete parent (finite 
mixture model), and continuous child-continuous par- 
ent (Kalman filter), which was hitherto not possible in 
PLP frameworks. It can also be used for hybrid mod- 
els, e.g., models that mix discrete and Gaussian dis- 
tributions. For instance, consider the mixture model 
example (Example[T]) where w(a) is Gaussian but w(b) 
is a discrete distribution with values 1 and 2 with 0.5 
probability each. The density of the mixture distribu- 
tion can be written as 

f{X) = 0.3A/'x(2.0, 1.0) + 0.35(5i.o(X) + 0.35(S2.o(^) 

Thus the language can be used to model problems that 
lie outside traditional hybrid Bayesian networks. 

ProbLog and LPAD do not impose PRISM's mutual 
exclusion and independence restrictions. Their infer- 
ence technique first materializes the set of explanations 
for each query, and represents this set as a BDD, where 
each node in the BDD is a (discrete) random variable. 
Distinct paths in the BDD are mutually exclusive and 



variables in a single path are all independent. Probabil- 
ities of query answers are computed trivially based on 
this BDD representation. The technical development in 
this paper is limited to PRISM and imposes its restric- 
tions. However, by materializing the set of symbolic 
derivations first, representing them in a factored form 
(such as a BDD) and then computing success functions 
on this representation, we can readily lift the restric- 
tions for the parameter learning technique. 

This paper considered only univariate Gaussian dis- 
tributions. Traditional parameter learning techniques 
have been described for multivariate distributions with- 
out introducing additional machinery. Extending our 
learning algorithm to the multivariate case is a topic of 
future work. 
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